Exploring White Wine Dataset with R by Li Chang
As the final project for the Udacity Data Analysis with R, I decided to educate myself about the white wine. I downloaded dataset from here and loaded it in R. The research question is what physicochemical properties will affect the taste preference.
I - Univariate Plots Section
Below I examined each variable in the dataset. I started with a basic plot and then revised the plot to be more clear and user-frienly. It helped me understand the distribution of each variable and decide if I need to tidy up something.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
quality
In our dataset, the “quality” variable ranges between 3 and 9 so there is neither very bad nor very excellent wine. Also, there are only 25 wines rated either 3 or 9. From bivariate section, i excluded these 25 cases from the analysis. Though quality is an integer, it makes more sense as an ordinal variable so I can compare the physicochemicals across different wine qualities.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5

fixed acidity
The basic histogram shows that fixed acidity has really few values less than 3 and a long tail after 10. So I limit the x axis range. Changing binwidth also shows more clearly that the majority of the fixed acidities fall between 5.5 and 8.5.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect

volatile acidity
After adjusting bin width, I can see that most wines have an acetic acid between 0.15-0.4g/liter. I know that a high level of acetic acid will cause an unpleasant vigenar taste and therefore poor sensory rating. I can test it in the next section.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect

citric acid
The majority of citric acidity level fall between 0.15-0.5g/litre with a spike at the level of 0.49g/litre. In contrast to volatile acidity, citric acidity add freshness to the wine.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect

residual sugar
Residual sugar has a wide range between 0.6-65.8g/litre. This is because wine producers try to cater to varying consumers’ preference of sweetness. Some people like me favor sweet wines, while others might prefer bone dry.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect

chlorides
Most wines has an amount of sodium chloride between 0.025-0.06g/liter, with a mean of 0.046g/liter and median of 0.043g/liter. The highest level in this dataset is 0.346g/liter.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

free sulfur dioxide
The free sulfur dioxide has a wide range from 2 to 289mg/liter, with the majority of the value falling between 10-55mg/liter. Since free sulfur dioxide becomes noticeable at 50 mg/l, I assume it will affect the taste.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

total sulfur dioxide
Similar to free sulfur dioxide, total sulfur dioxide also has a wide range from 9 to 440mg/liter.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

density
Density has a small range between 0.99 to 1.04g/cc. It mostly depends on the percent of alcohol and sugar in the wine.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect

PH
PH has a small range between 2.7 to 3.8, highly acetic!
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect

alcohol
The majority of alcohol values fall between 9% to 13%. An appropriate level of alcohol enhances the flavor but a high level of alcohol will cause a negative burning sensation.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect

Univariate Analysis
The dataset has a total of 4898 observations and 13 variables. Almost all variables are numeric.
Majority of white wine in this dataset fall between 4-8 rating on a scale of 10. We don’t have very bad or very excellent wines. Quality is the main feature I am interested in. I wonder what physicochemical elements will influence the taste preference. However, there are only 20 wines rated 3 on a scale of 10 and 5 wines rated 9. I think it’s better to drop these few cases before the bivariate and multi-variate analysis just so that we can focus more on the bulk of the data. Other than that, the dataset is relatively clean.
Residual sugar appear to have a wide range between 0.6-65.8g/litre, supposedly to accomdate customers’ varying palettes for sweetness.
Free sulfur dioxide ranges between 2 to 289mg/liter but be aware that the smell becomes noticeable at the level above 50 mg/l.
White wine is highly acetic with pH level ranging from 2.7 to 3.8.
Below I used Random Forests to examine which factors likely affect wine quality most. The results are pretty expected. Alcohol comes up the most important factor. Density measures the level of ethanol in the wine, which essentially depends on alcohol and sugar (as shown in Bivariate Plots Section). So no surprise it was given high importance. The volatile acidity and free sulfur dioxide follow closely after alcohol content. Both of them have such “volatile” properties that an exessive amount will cause unpleasant smell or taste. So alchol, volatile acidity, and free sulfur dioxide are the main features of my interest for further investigation.
fit <- randomForest(quality.cat ~ fixed.acidity +
volatile.acidity +
citric.acid +
residual.sugar +
chlorides +
free.sulfur.dioxide +
total.sulfur.dioxide +
density +
pH +
sulphates +
alcohol, data=wine)
print(fit) # view results
##
## Call:
## randomForest(formula = quality.cat ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = wine)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 28.22%
## Confusion matrix:
## 3 4 5 6 7 8 9 class.error
## 3 0 0 8 12 0 0 0 1.0000000
## 4 0 39 75 47 2 0 0 0.7607362
## 5 0 8 1050 386 13 0 0 0.2793411
## 6 0 4 248 1827 117 2 0 0.1687898
## 7 0 0 14 341 519 6 0 0.4102273
## 8 0 0 1 50 43 81 0 0.5371429
## 9 0 0 0 3 2 0 0 1.0000000
importance <- importance(fit) # importance of each predictor
data.matrix(importance[ order( -importance[,1]), ]) #order by feature importance
## [,1]
## alcohol 383.8287
## density 338.6329
## volatile.acidity 330.6913
## free.sulfur.dioxide 311.0399
## total.sulfur.dioxide 304.4219
## residual.sugar 292.5607
## pH 285.4842
## chlorides 281.5678
## citric.acid 266.3554
## sulphates 265.3936
## fixed.acidity 246.5528
Quality is converted from integer to factor. I also used 50mg/l to create a new variable “free.sulfur.dioxide.cat” as “noticeable” (free.sulfur.dioxide > 50) and “not noticeable” (free.sulfur.dioxide <= 50).
Looking through the summary data above, residual.sugar, free.sulfur.dioxide, and total.sulfur.dioxide appear to have wide ranges. In particular, wines with the amount of residual sugar above 45 grams per litr are considered very sweet. However, I don’t think the wide distribution should be adjusted as some wines may display very far-stretching characteristics.
II - Bivariate Plots and Analysis
Below I used correlation matrix to visualize the pair-wise correlation between two variables. The only moderate positive correlation (0.44) between quality and alcohol.

Main Features of Interest
Both correlation matrix and Random Forest feature selection show that the relationship between alcohol and wine rating is the strongest (correlation = 0.44).
Interestingly, the relationship between alcohol and rating doesn’t seems to be linearly positive. For good wine (quality rating above 5), the higher the alcohol level is, the better the rating is. This leads me to think that there must be other factors that cause this parabola curve between alcohol and quanlity rating.

Alcohol level reinforces acidity.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Free sulfur dioxide decreases when alcohol level rises.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 13 rows containing missing values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).

Consistent with correlation matrix, although the random forest feature selection placed free sulfur dioxide and volatile acidity among the top four important features, charting against quality rating doesn’t show strong relatinship, which could be masked by other factors. 
Other intereting features
Density depends on the percentage of alcohol in the water. The plot below clearly shows that the density decreases when the amount of alcohol increases.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Additionally, density increases with sugar content. So density shouldn’t be included in the presence of alcohol or sugar in any modelling, just to minimize multicullinearity.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

It’s said that at free sulfur dioxide over 50mg/l, you can smell the odor of sulfur in wine. There is an infliction point when free sulfur dioxide reaches 50mg/l.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Multivariate Plots and Analysis
Within the same quality, a wine without noticeable sulfur smell is likely to have higher alcohol level. In other words, keeping alcohol constant, you will likely get a better rated wine if the sulfur level is unnoticeable. Also, for people who are concerned with the health issue caused by sulfur, choosing a wine with higher concentration of alcohol would likely reduce the intake of sulfur (well, if alcohol is less of concern to them). 
When I break down quality by alcohol level and volatile acidity, for alcohol level between 7.99 - 10.1, the negative relationship between volatile acidity and quality becomes the strongest.
## (7.99,10.1] (10.1,12.1] (12.1,14.2]
## 2078 2143 652
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

The boxplot below clearly shows that the degree of negative relationship between wine quality and total sulfur dioxide is stronger among wines with free sulfur dioxide above 50mg/l.

Final Plots and Summary
Plot One - Quality vs. Alcohol and Free Sulfur Dioxide

Description One
Following this great visualization example, I was able to combine three graphs to one. Charts on the top and right side show the distribution of quality ratings and alcohol, respectively, by whether the free sulfur dioxide will be noticeable or not. As you can see, most cluster between 5-7 (okay wines). Wines with no noticeable free sulfur dioxide have slightly higher ratings between 6-7. Those with free sulfur dioxide above 50 mg/l have lower alcohol concentration. This pattern holds true no matter what quality rating is (middle graph).
Plot Two
## Warning: Removed 68 rows containing missing values (stat_smooth).
## Warning: Removed 68 rows containing missing values (geom_point).
## Warning: Removed 65 rows containing missing values (stat_smooth).
## Warning: Removed 65 rows containing missing values (geom_point).

Description Two
The combined two charts below plot alcohol against free sulfur dioxide and volatile acidity. The higher the alcohol level is, the less the free sulfur dioxide but the more acidity it will have. It is because alcohol can mask the unpleasnt odor and enhance acidity.
Plot Three

Description Three
The boxplot clearly shows that the degree of negative relationship between wine quality and total sulfur dioxide is stronger among wines with free sulfur dioxide above 50mg/l when the unpleasant smell of sulfur really hurts the taste.
Reflection
My learnings after exploring the white wine dataset:
Alcohol is the most important factor that affects the taste of the wines.
Alcohol can suppress the unpleasant odor and enhance acidity.
Free sulfur dioxide is really critical. 50mg/l is excessive and makes unpleasant smell noticeable that hurts the taste bud.
This white wine dataset is the most tidy one I’ve ever used for Udacity projects. However, I was frustrated in the beginning because except alcohol, almost all other input variables don’t have a strong relationship with wine quality. Reading correlation matrix is not enough. When conditioning on other relevant variables, the relationships between the physicochemical properties and quality became clear. Also, all input variables are continous variables which limited the type of graphs I could make. One solution I made was to recode to categorical variables.
The other problem I have is my knowledge about the physicochemicals and how they interacted were limited before starting this project. I had to resort to additional readings to brush up my wine knowledge.
This dataset is pretty limited with 13 input variables, it will be great if other variables such as grape type and wine age can be included for further investigation.